AITopics | robot demonstration

Collaborating Authors

robot demonstration

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

Pratyusha Sharma, Deepak Pathak, Abhinav Gupta

Neural Information Processing SystemsFeb-12-2026, 20:56:05 GMT

Neural Information Processing Systems http://nips.cc/

controller, demonstration, robot, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
(2 more...)

Add feedback

Invariance Co-training for Robot Visual Generalization

Yang, Jonathan, Finn, Chelsea, Sadigh, Dorsa

arXiv.org Artificial IntelligenceDec-8-2025

Abstract-- Reasoning from diverse observations is a fundamental capability for generalist robot policies to operate in a wide range of environments. Despite recent advancements, many large-scale robotic policies still remain sensitive to key sources of observational variation--such as changes in camera perspective, lighting, and the presence of distractor objects. We posit that the limited generalizability of these models arises from the substantial diversity required to robustly cover these quasistatic axes, coupled with the current scarcity of large-scale robotic datasets that exhibit rich variation across them. In this work, we propose to systematically examine what robots need to generalize across these challenging axes by introducing two key auxiliary tasks--state similarity and invariance to observational perturbations--applied to both demonstration data and static visual data. We then show that via these auxiliary tasks, leveraging both more-expensive robotic demonstration data and less-expensive, visually rich synthetic images generated from non-physics-based simulation (e.g., Unreal Engine) can lead to substantial increases in generalization to unseen camera viewpoints, lighting configurations, and distractor conditions. Our results demonstrate that co-training on this diverse data improves performance by 18% over existing generative augmentation methods. Robotic foundation models have shown impressive progress in generalizing to everyday scenarios by leveraging large-scale datasets spanning multiple embodiments, environments, and tasks [1], [2]. However, despite their breadth, the resulting models often remain brittle in real-world settings--failing to handle unseen spatial configurations of objects or adapt to drastic visual changes such as lighting and viewpoint shifts. We hypothesize that the brittleness of current robotic policies stems from insufficient coverage of key observational factors during training. For example, many large-scale datasets provide only one or two third-person perspectives per scene, limiting robustness to viewpoint shifts.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.0523

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Pace, Maximus A., Dan, Prithwish, Ning, Chuanruo, Bhardwaj, Atiksh, Du, Audrey, Duan, Edward W., Ma, Wei-Chiu, Kedia, Kushal

arXiv.org Artificial IntelligenceNov-7-2025

Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

artificial intelligence, demonstration, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.04671

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.46)

Add feedback

Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

Pratyusha Sharma, Deepak Pathak, Abhinav Gupta

Neural Information Processing SystemsOct-3-2025, 04:16:33 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, demonstration, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America (0.28)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation

Liu, Yangcen, Shin, Woo Chul, Han, Yunhai, Chen, Zhenyang, Ravichandar, Harish, Xu, Danfei

arXiv.org Artificial IntelligenceSep-16-2025

Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation. The project website can be found at https://sites.google.com/view/immimic.

artificial intelligence, demonstration, robot demonstration, (16 more...)

arXiv.org Artificial Intelligence

2509.10952

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (1.00)

Add feedback

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Yang, Ruihan, Yu, Qinxi, Wu, Yecheng, Yan, Rui, Li, Borui, Cheng, An-Chieh, Zou, Xueyan, Fang, Yunhao, Cheng, Xuxin, Qiu, Ri-Zhao, Yin, Hongxu, Liu, Sifei, Han, Song, Lu, Yao, Wang, Xiaolong

arXiv.org Artificial IntelligenceJul-21-2025

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA

artificial intelligence, collision approximation, egovla, (14 more...)

arXiv.org Artificial Intelligence

2507.1244

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (0.89)

Add feedback

RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot

Heng, Liang, Li, Xiaoqi, Mao, Shangqing, Liu, Jiaming, Liu, Ruolin, Wei, Jingli, Wang, Yu-Kai, Jia, Yueru, Gu, Chenyang, Zhao, Rui, Zhang, Shanghang, Dong, Hao

arXiv.org Artificial IntelligenceJul-9-2025

Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data. However, the former requires both a robotic system and a skilled operator, limiting scalability, while the latter faces challenges in aligning the visual gap between human hand demonstrations and the deployed robot observations. To address this, we propose a human hand data collection system combined with our hand-to-gripper generative model, which translates human hand demonstrations into robot gripper demonstrations, effectively bridging the observation gap. Specifically, a GoPro fisheye camera is mounted on the human wrist to capture human hand demonstrations. We then train a generative model on a self-collected dataset of paired human hand and UMI gripper demonstrations, which have been processed using a tailored data pre-processing strategy to ensure alignment in both timestamps and observations. Therefore, given only human hand demonstrations, we are able to automatically extract the corresponding SE(3) actions and integrate them with high-quality generated robot demonstrations through our generation pipeline for training robotic policy model. In experiments, the robust manipulation performance demonstrates not only the quality of the generated robot demonstrations but also the efficiency and practicality of our data collection method. More demonstrations can be found at: https://rwor.github.io/

artificial intelligence, demonstration, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2507.0393

Country: Asia > China (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.47)

Add feedback

Imitation Learning with Precisely Labeled Human Demonstrations

Song, Yilong

arXiv.org Artificial IntelligenceApr-21-2025

--Within the imitation learning paradigm, training generalist robots requires large-scale datasets obtainable only through diverse curation. Due to the relative ease to collect, human demonstrations constitute a valuable addition when incorporated appropriately. However, existing methods utilizing human demonstrations face challenges in inferring precise actions, ameliorating embodiment gaps, and fusing with frontier generalist robot training pipelines. In this work, building on prior studies that demonstrate the viability of using hand-held grippers for efficient data collection, we leverage the user's control over the gripper's appearance--specifically by assigning it a unique, easily segmentable color--to enable simple and reliable application of the RANSAC and ICP registration method for precise end-effector pose estimation. We show in simulation that precisely labeled human demonstrations on their own allow policies to reach on average 88 .1% of the performance of using robot demonstrations, and boost policy performance when combined with robot demonstrations, despite the inherent embodiment gap.

artificial intelligence, demonstration, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2504.13803

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Zhao, Qingqing, Lu, Yao, Kim, Moo Jin, Fu, Zipeng, Zhang, Zhuoyang, Wu, Yecheng, Li, Zhaoshuo, Ma, Qianli, Han, Song, Finn, Chelsea, Handa, Ankur, Liu, Ming-Yu, Xiang, Donglai, Wetzstein, Gordon, Lin, Tsung-Yi

arXiv.org Artificial IntelligenceMar-27-2025

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

arxiv preprint arxiv, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.2202

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Train Robots in a JIF: Joint Inverse and Forward Dynamics with Human and Robot Demonstrations

Khandate, Gagan, Wang, Boxuan, Park, Sarah, Ni, Weizhe, Palacious, Jaoquin, Lampo, Kate, Wu, Philippe, Ho, Rosh, Chang, Eric, Ciocarlie, Matei

arXiv.org Artificial IntelligenceMar-15-2025

Pre-training on large datasets of robot demonstrations is a powerful technique for learning diverse manipulation skills but is often limited by the high cost and complexity of collecting robot-centric data, especially for tasks requiring tactile feedback. This work addresses these challenges by introducing a novel method for pre-training with multi-modal human demonstrations. Our approach jointly learns inverse and forward dynamics to extract latent state representations, towards learning manipulation specific representations. This enables efficient fine-tuning with only a small number of robot demonstrations, significantly improving data efficiency. Furthermore, our method allows for the use of multi-modal data, such as combination of vision and touch for manipulation. By leveraging latent dynamics modeling and tactile sensing, this approach paves the way for scalable robot manipulation learning based on human demonstrations.

artificial intelligence, demonstration, human demonstration, (14 more...)

arXiv.org Artificial Intelligence

2503.12297

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)

Genre: Research Report > New Finding (0.94)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (0.89)

Add feedback